Quality of White Wines by Alan Gou

Preliminary Exploration

Before I begin plotting the data, I want to first figure out a couple of things about the variables. First, how many of each quality are there?

summary(wf)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
table(wf$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Dataset Information

There are 4898 observations of 12 variables. Each observation (or row) is 11 variables descirbing various chemical/physical aspects of a wine plus the median of the ratings given by judges of that wine, 0 being the lowest rating and 10 being the highest.

Goal of analyzing this dataset

Quality is the feature of interest - the goal of this analysis is to explore what other features of the data explain the quality the wines.

Some preliminary expectations

From what I have read in the readme for the dataset, I am expecting levels of sulfur dioxide to play a part in determining quality - it seems like there ought to be a balance of sulfur dioxide. Too much will cause a bad, sulfurous odor, while too little may make the wine not fresh. Beyond that, my non-existent knowledge of wine would have me expect that sugar levels, alcohol content, and salt content would all have some sort of effect on quality, though in what way I really have no idea at this point. I also expect acidity to be a factor in determining quality

Creating new ‘quality_level’ bucket

As you can see, there are no wines with ratings of 0, 1, 2, or 10. There are only 5 wines with ratings of 9 and 20 with ratings of 3. This seems like a good indication that I can group some of these variables together into buckets: “high”, “medium-high”, “medium”, “medium-low”, and “low”. I’ll do this by adding a new variable: quality_level. This will let me use geom_freqpoly and facet_wrap more effectively, since I won’t have one category with only 5 observations in it and another category with over 2000. Low is 3 and 4, medium low is 5, medium is 6, medium-high is 7, and high is 8 and 9.

This distribution is somewhat normal, though there are several hundred more medium-low wines than medium-high wines. Still, I think this will serve as a suitable replacement for quality in terms of plotting.

Deeper Exploration - Investigating Correlations Between Features

A bit of grouping

## Source: local data frame [5 x 12]
## 
##   quality_level median_fixed_acidity median_volatile_acidity
## 1           low                  6.9                    0.32
## 2    medium-low                  6.8                    0.28
## 3        medium                  6.8                    0.25
## 4   medium-high                  6.7                    0.25
## 5          high                  6.8                    0.26
## Variables not shown: median_citric_acid (dbl), median_total_acidity (dbl),
##   median_alcohol (dbl), median_sugar (dbl), median_ph (dbl),
##   median_chlorides (dbl), median_total_so2 (dbl), median_free_so2 (dbl),
##   median_sulphates (dbl)

Here, I have done some aggregations on the various features. I have also created a new variable called total acidity, which is the sum of citric, fixed, and volatile acidity.

Investigating Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wf$fixed.acidity and wf$volatile.acidity
## t = -1.5886, df = 4896, p-value = 0.1122
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.050671536  0.005312543
## sample estimates:
##         cor 
## -0.02269729

## 
##  Pearson's product-moment correlation
## 
## data:  wf$citric.acid and wf$fixed.acidity
## t = 21.137, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2633067 0.3146389
## sample estimates:
##       cor 
## 0.2891807
## 
##  Pearson's product-moment correlation
## 
## data:  wf$citric.acid and wf$volatile.acidity
## t = -10.578, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1767384 -0.1219760
## sample estimates:
##        cor 
## -0.1494718

## 
##  Pearson's product-moment correlation
## 
## data:  wf$total.acidity and wf$pH
## t = -33.388, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4531918 -0.4075605
## sample estimates:
##        cor 
## -0.4306513

The correlation between fixed and volatile acidity is pretty small, but there is a correlation between citric acid and fixed acidity of 0.29. This is positive, unlike the correlation between citric acid and volatile acidity of -0.15. I am not sure why that is, but it seems interesting that more citric acid increases fixed acidity but decreases volatile acidity.

As expected, total acidity is negatively correlated with pH - more acids (obviously) mean lower pH.

Sugar, Alcohol, and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  wf$alcohol and wf$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

What seems interesting here is that while plotting sugar on its own against quality does not show much of a correlation, plotting residual sugar against alcohol and then coloring by quality seems to show that higher quality wines, which tend to have higher alcohol contents, also tend to have lower sugar levels than wines with lower alcohol contents. It is clear that plotting sugar with alcohol content strengthened both features.

The correlation between alcohol and sugar is -0.45, which is very strong. As alcohol increases, sugar levels tend to decrease, which confirms what we see in our plot. Perhaps this is a result of wine-creating bacteria consuming more sugar to produce more ethanol.

Sugar, Alcohol, Chlorides, and Density

## 
##  Pearson's product-moment correlation
## 
## data:  wf$residual.sugar and wf$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

## 
##  Pearson's product-moment correlation
## 
## data:  wf$alcohol and wf$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

## 
##  Pearson's product-moment correlation
## 
## data:  wf$chlorides and wf$density
## t = 18.624, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2308679 0.2831779
## sample estimates:
##       cor 
## 0.2572113

As expected, both alcohol content and residual sugar are highly correlated with density. If we were to create a linear regression model for quality, we should avoid having all three of these variables in the model, as multicollinearity would become a significant problem. Salt content is also correlated with density, though to a lesser extent than the other two features.

Investigating salt content

## 
##  Pearson's product-moment correlation
## 
## data:  wf$chlorides and wf$residual.sugar
## t = 6.2299, df = 4896, p-value = 5.057e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06082916 0.11640188
## sample estimates:
##        cor 
## 0.08868454

## 
##  Pearson's product-moment correlation
## 
## data:  wf$chlorides and wf$alcohol
## t = -27.016, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3843183 -0.3355673
## sample estimates:
##        cor 
## -0.3601887

What is interesting is the inverse correlation between alcohol and chlorides, which I would not have expected. It seems that there are no wines with low alcohol content and low chloride levels and no wines with high alcohol content and high chloride levels. I am not sure why that is - perhaps it is a side effect of making wine with high alcohol content, or that high quality wines are produced with the goal of high alcohol content and low salt content in mind. Regardless, they are correlated, so we should bear that in mind while constructing a model so as to keep multicollinearity at a minimum.

Investigating sulfate levels

## 
##  Pearson's product-moment correlation
## 
## data:  wf$sulphates and wf$total.sulfur.dioxide
## t = 9.5019, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1069590 0.1619585
## sample estimates:
##       cor 
## 0.1345624
## 
##  Pearson's product-moment correlation
## 
## data:  wf$sulphates and wf$free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03126264 0.08707928
## sample estimates:
##        cor 
## 0.05921725

## 
##  Pearson's product-moment correlation
## 
## data:  wf$free.sulfur.dioxide and wf$total.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

The level of sulphates in a wine does not seem to be very closely related to the amount of sulfur dioxide both in gaseous and dissolved form, though this is expected because the readme for the dataset says that sulphate level contributes only a small amount to sulfur dioxide.

As expected, total sulfur dioxide and free sulfur dioxide are pretty strongly correlated.

Summary of findings so far

First, let us talk about the features other than the feature of interest that are correlated with each other. Some were obvious and expected, others are not: - alcohol and density - sugar and density - chlorides and alcohol - total sulfur dioxide and free sulfur dioxide - all the various acidities with pH

Now, let us list how these features relate to quality: - higher alcohol and higher quality - lower sugar and higher quality - lower total sulfur dioxide and higher quality - lower acidity and higher quality - lower salt content and higher quality

With this information, we can improve on our expectations of what makes for a high quality wine. Good wines tend to have higher alcohol contents, fruitier flavor (due to higher citric acid content), lower sugar levels, lower salt levels, lower sulfur dioxide levels, and lower overall acidity. I have left out features such as density, which is too strongly correlated with more important features such as alcohol content and chloride levels, and sulphates, which does not seem to be correlated with quality and is only very slighty correlated with total sulfur dioxide.

Quick Model

## 
## Calls:
## m1: lm(formula = alcohol ~ quality, data = wf)
## m2: lm(formula = alcohol ~ quality + volatile.acidity, data = wf)
## m3: lm(formula = alcohol ~ quality + volatile.acidity + chlorides, 
##     data = wf)
## m4: lm(formula = alcohol ~ quality + volatile.acidity + chlorides + 
##     residual.sugar, data = wf)
## m5: lm(formula = alcohol ~ quality + volatile.acidity + chlorides + 
##     residual.sugar + total.sulfur.dioxide, data = wf)
## 
## ============================================================================
##                           m1         m2         m3         m4         m5    
## ----------------------------------------------------------------------------
## (Intercept)            6.957***   6.166***    7.351***   8.085***   8.963***
##                       (0.106)    (0.123)     (0.127)    (0.114)    (0.117)  
## quality                0.605***   0.648***    0.567***   0.526***   0.493***
##                       (0.018)    (0.018)     (0.017)    (0.015)    (0.015)  
## volatile.acidity                  1.936***    2.043***   2.264***   2.365***
##                                  (0.158)     (0.150)    (0.132)    (0.127)  
## chlorides                                   -16.128*** -14.541*** -12.681***
##                                              (0.693)    (0.612)    (0.594)  
## residual.sugar                                          -0.098***  -0.076***
##                                                         (0.003)    (0.003)  
## total.sulfur.dioxide                                               -0.007***
##                                                                    (0.000)  
## ----------------------------------------------------------------------------
## R-squared                 0.190      0.214      0.292      0.452      0.495 
## adj. R-squared            0.190      0.214      0.292      0.451      0.495 
## sigma                     1.108      1.091      1.036      0.912      0.875 
## F                      1146.395    666.007    673.471   1008.012    959.723 
## p                         0.000      0.000      0.000      0.000      0.000 
## Log-likelihood        -7450.661  -7376.455  -7119.517  -6493.901  -6291.857 
## Deviance               6009.118   5829.768   5249.127   4065.773   3743.809 
## AIC                   14907.323  14760.910  14249.034  12999.801  12597.714 
## BIC                   14926.812  14786.896  14281.517  13038.781  12643.190 
## N                      4898       4898       4898       4898       4898     
## ============================================================================

Modeling

In the end, using just a pretty basic linear model, we get an R-squared of 0.496, which is not too shabby. Of course, this is far from a perfect model - a linear regression simply cannot capture all the subtleties of the data. I also included both chlorides and alcohol in the model, even though I already know that they are correlated with each other. Thus, there is some degree of multicollinearity that is negatively affecting the truth of the model.


Final Plots and Summary

Plot One

Description One

This plot shows very obviously that there is definitely a trend towards higher alcohol content as wine quality increases. Just having a higher alcohol content seems to be a huge factor in determining wine quality - the entire boxplot moves up for each increase in quality level, which is not something I would have expected. It really makes me wonder why exactly alcohol is so strongly correlated with wine quality, and whether that bears out in real life. This plot sparked much of the exploration in regard to whether other features were strongly correlated with alcohol - is higher alcohol content a result of a general higher-quality wine making process, or is it purposefully sought after in the wine making process? I spent much of my time trying to explore this angle in this report.

Plot Two

Description Two

I selected these first two plots because they reveal quirks of the data that you wouldn’t have been able to see otherwise. During my EDA, it was hard to see whether sugar content related at all to wine quality - different levels of sugar content seemed to be distributed quite evenly across all wine qualities. However, this plot immediately reveals two things: (1) higher quality alcohol does, in fact, have lower sugar levels, and (2) there are no high alcohol and high sugar content wines. The insights this plot offered me meant I now was willing to use residual sugar as a feature in the linear regression model I hoped to build, since it was clearly correlated with wine quality. And this paid off - adding residual.sugar to my linear model raised the R-squared value (unadjusted) from 0.292 to 0.453.

Plot Three

Description Three

This reveals a relationship between features that I had not expected at all. For some reason, alochol content seems to be inversely correlated with salt content - and high quality wines are overwhelmingly concentrated in the area of the plot where salt content is low and alcohol percentage is high.


Reflection

This project was intimidating at first because there were so many features. Which ones should I concentrate on? Which ones would actually have any effect? And once I plotted the distributions of each with regard to quality, I did not come out as elucidated as I had thought I would be - only alcohol and perhaps salt seemed to contribute in any way to wine quality. This was unlike the diamond data set, in which features were fewer and there were universally defined metrics for what made a better diamond.

Still, there were a few common sense hunches that I had regarding what would affect wine quality - I feel that oftentimes, our own intuition is where we begin in such investigations, and in the process of confirming or invalidating those intuitions, we discover new quirks and trends that would not have occurred to us without such exploration. That is what happened with me - I felt that sugar levels and acidity ought to have some significant effect on wine quality.

After trying various plots with sugar levels, I was about ready to give up. There seemed to be no rhyme nor reason with sugar content across different quality wines. However, when I finally plotted sugar vs. alcohol content and colored the points by quality, sugar’s inverse relationship with wine quality finally revealed itself. Needless to say, I was pleased. However, this plot also revealed sugar’s inverse correlation with alcohol, which made me wonder why exactly would there be a relationship between alcohol and sugar? Is it because of the fermentation process that converts sugar into ethanol, and therefore the higher the alcohol content, the lower the sugar level?

This induced me to investigate further the relationship between alcohol and other features, and I found, to my surprise, that chlorides and alcohol were also inversely correlated. Wines with high alcohol contents also had low salt contents, and were generally rated higher, than wines with low alcohol contents with high salt contenst, which were generally rated lower. In fact, there seemed to be a relative dearth of wines that had both high alcohol and high salt contents as well as both low alcohol and low salt contents. This begs the same question that the discovery of sugar’s relationship with alcohol evoked: was this a result of the wine making process that naturally meant high quality wines had high alcohol contents and low salt contents, or was this due to wine makers purposely choosing to make wines with these characteristics? I do not think this is a question that can be answered with EDA alone - it would require an understanding of the wine making process as well.

Once I got the ball rolling in mixing and matching features to see if anything strange and interesting popped out, it was a relatively straightforward process to see how acidity related to wine quality. Strangely enough, it turned out that higher citric acid was correlated with higher wine quality even though overall acidity (as measured by pH and my total acidity variable) was correlated with lower wine quality. I attributed this to higher citric acid levels making wines taste fruitier. Also, total acidity was dominated by fixed acidity - citric acid was a small enough component of total acidity that its level was nearly neglible in determining pH, so this was actually a finding that made sense. Too acidic of a wine probably tastes bad, but fruitier wine tastes better.

Looking forward

There are still many things that could be done. There are some combinations of features that I have not plotted - namely, that between sulphates and sulfur dioxide levels with density, and whether that could change anything in my analysis. Perhaps using more boxplots would also reveal some interesting things.

Also, if I were to spend more time on this, I would likely create more robust models for predicting wine quality - using naive Bayes, or vector models, or a logistic regression. My linear model had decent results, but is not as good of a model as a model could be.

I would like to actually compare white wines with red wines - there would probably be a lot of interesting insights into the character of these two wines, in terms of their various acidities, alcohol contents, sugar levels, etc., and what makes for a high quality red or white wine.